Optimizing data management for MapReduce applications on large-scale distributed infrastructures

نویسندگان

  • Diana Maria Moise
  • Frédéric DESPREZ
  • Gabriel ANTONIU
چکیده

ions were developed based on MapReduce, with the goal of providing a simple-touse interface for expressing database-like queries [64, 6]. Bioinformatics is one of the numerous research domains that employ MapReduce to model their algorithms [69, 58, 56]. As an example, CloudBurst [69] is a MapReduce-based algorithm for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses. Other research areas where MapReduce is widely employed, include: astronomy [76], social networks [26], artificial intelligence [39, 60], image and video processing, simulations, etc. 2.2 Hadoop-based applications 2.2.1 The Hadoop project The Hadoop project [15] was founded by Yahoo! in 2006. It started out as an open-source implementation of the MapReduce model promoted by Google. By 2008, Hadoop was being used by other large companies apart from Yahoo!, such as Last.fm and Facebook. A notable Hadoop use-case belongs to the New York Times. In 2007, the company rented resources on Amazon’s EC2 in order to convert the newspaper’s scanned archives to PDF files. The data reached 4 TB in size and the computation took less than 24 hours on 100 machines. Hadoop was used to run this application in the cloud environment provided by Amazon. In 2008 and then in 2009, Hadoop broke the world record for sorting 1 TB. Using Hadoop, Yahoo! managed to sort a terabyte of data in 62 seconds. Currently, the world record is 60 seconds, held by a team from the University of California, San Diego. The core of the Hadoop project consists of the MapReduce implementation and the Hadoop Distributed File System (HDFS). Along the years, several sub-projects have been developed as part of the Hadoop collection. The sub-projects include frameworks that cover a wide range of the distributed computing area. They were either built as complementary to the Hadoop core, or on top of the core, with the purpose of providing higher-level abstractions. Currently, the Hadoop project offers the following services. MapReduce: a framework for large-scale data processing. HDFS: a distributed file system for clusters of commodity hardware. Avro: a system for efficient, platform-independent data serialization. Pig: a distributed infrastructure for performing high-level analysis on large data sets. HBase: a distributed, column-oriented storage for large amounts of structured data, on top of HDFS. ZooKeeper: a service which enables highly reliable coordination for building distributed applications. Hive: a data-warehouse system that provides data summarization, ad-hoc queries, and the analysis of large datasets stored in HDFS. te l-0 06 53 62 2, v er si on 3 10 M ay 2 01 2 2.2 – Hadoop-based applications 17 Chukwa: a system for collecting and analyzing data on large-scale platforms. It also includes tools for displaying, monitoring and analyzing results, in order to make the best use of the collected data. Cassandra: a distributed database management system. It was designed to handle very large amounts of data, while providing a highly-available service with no single point of failure. Ever since it was released, Hadoop’s popularity rapidly increased, as a result of the features it yields, such as performance, simplicity and inexpensiveness. The list of Hadoop users [12] includes companies and institutes that employ one or several Hadoop projects for research or production purposes. Companies such as Adobe and EBay use Hadoop MapReduce, HBase and Pig for structured data storage and search optimization. Facebook makes use of HDFS and Hive for storing logs and data sources and performing queries on them. Twitter heavily uses Pig, MapReduce and Cassandra to process all types of data generated across Twitter. Reports from Yahoo! show that Hadoop is currently running on more than 100,000 CPUs, and on the largest Hadoop cluster (comprising 4500 nodes and several terabytes of RAM and petabytes of storage). Yahoo! also reports that more than 60 % of the Hadoop jobs it runs are Pig jobs. In addition to its cluster usage, Hadoop is becoming a de-facto standard for cloud computing. The generic nature of cloud computing allows resources to be purchased ondemand, especially to augment local resources for specific large or time-critical tasks. Several organizations offer cloud compute cycles that can be accessed via Hadoop. Amazon’s Elastic Compute Cloud (EC2) contains tens of thousands of virtual machines, and supports Hadoop with minimal effort. 2.2.2 The Hadoop MapReduce implementation The Hadoop project provides an open-source implementation of Google’s MapReduce paradigm through the Hadoop MapReduce framework [16, 75]. The framework was designed following Google’s architectural model and has become the reference MapReduce implementation. The architecture is tailored in a master-slave manner, consisting in a single master jobtracker and multiple slave tasktrackers.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

BlobSeer: Next-generation data management for large scale infrastructures

As data volumes increase at a high speed in more and more application fields of science, engineering, information services, etc., the challenges posed by data-intensive computing gain an increasing importance. The emergence of highly scalable infrastructures, e.g. for cloud computing and for petascale computing and beyond introduces additional issues for which scalable data management becomes a...

متن کامل

Energy Efficient Data Intensive Distributed Computing

Many practically important problems involve processing very large data sets, such as for web scale data mining and indexing. An efficient method to manage such problems is to use data intensive distributed programming paradigms such as MapReduce and Dryad, that allow programmers to easily parallelize the processing of large data sets where parallelism arises naturally by operating on different ...

متن کامل

Understanding application-level interoperability: Scaling-out MapReduce over high-performance grids and clouds

Application-level interoperability is defined as the ability of an application to utilize multiple distributed heterogeneous resources. Such interoperability is becoming increasingly important with increasing volumes of data, multiple sources of data as well as resource types. The primary aim of this paper is to understand different ways and levels in which application-level interoperability ca...

متن کامل

Software Design and Implementation for MapReduce across Distributed Data Centers

Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge amount of data are processed on more than 140 computing centers distributed across 34 countries. The MapReduce paradigm has emerged as a highly succ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012